Academy for Engineering and Technology, Fudan University1
Digital Medical Research Center, School of Basic Medical Sciences, Fudan University2
Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention3
Shandong Computer Science Center (National Supercomputer Center in Jinan)4
Weakly supervised semantic segmentation (WSSS) with image-level labels aims to achieve segmentation tasks without dense annotations. However, attributed to the frequent coupling of co-occurring objects and the limited supervision from image-level labels, the challenging co-occurrence problem is widely present and leads to false activation of objects in WSSS. In this work, we devise a ‘Separate and Conquer’ scheme SeCo to tackle this issue from dimensions of image space and feature space. In the image space, we propose to ‘separate’ the co-occurring objects with image decomposition by subdividing images into patches. Importantly, we assign each patch a category tag from Class Activation Maps (CAMs), which spatially helps remove the co-context bias and guide the subsequent representation. In the feature space, we propose to ‘conquer’ the false activation by enhancing semantic representation with multi-granularity knowledge contrast. To this end, a dual-teacher-single-student architecture is designed and tag-guided contrast is conducted, which guarantee the correctness of knowledge and further facilitate the discrepancy among co-contexts. We streamline the multi-staged WSSS pipeline end-to-end and tackle this issue without external supervision. Extensive experiments are conducted, validating the efficiency of our method and the superiority over previous single-staged and even multi-staged competitors on PASCAL VOC and MS COCO. Code is available here.
Co-occurrence of objects is inevitable, always leading to false positive pixels activated with high probability, i.e., confusing the model by error-prone feature representation. To deal with such issue, a common practice is to introduce external supervision or human prior.
So, why not seperate the coupled objects first to generate patches at the beginning?
Each patch contains single category infomation, followed by enhancing category-specific representation with dual-teacher single-student archtecture.
overall architecture
Given input image \(\boldsymbol{I}\) containing \(K\) classes of objects \(\left\{Y_i\right\}\left(i = 1, 2, \cdots, K\right)\).
In the teacher’s \(\lambda\)th layer:
Construct auxiliary pseudo mask \(\boldsymbol{M}_\mathrm{aux}\) by:
\[\mathrm{CAM_aux} = \mathrm{ReLU}\left(\boldsymbol{W}_\lambda^\mathsf{T}\boldsymbol{Z}_\mathrm{F}^\lambda\right).\]
Use \(\mathrm{CAM_aux}\) to guide category tag allocation:
let \(\boldsymbol{m}_i = \mathrm{crop}\left(\boldsymbol{M}_\mathrm{aux}\right)\), and assign \(\boldsymbol{m}_i\) to \(\boldsymbol{x}_i\).
For representation of local patches, ViT is used to extract high-level semantics.
\(g_\mathrm{q}\) for student encoder, \(g_\mathrm{k}\) for local teacher encoder. Use class token in ViT to extract high-level semantics, then apply MLP to strenthen features.
decomposition spatially seperates co-occurrences, but may destruct the semantic context of patches.
Use a global teacher to extract knowledge from the entire image.
Share encoder between the global teacher and the student.
Instead of extracting semantics based on CAMs, the global teacher uses class tokens to represent high-level semantics and obtains knowledge \(\boldsymbol{P}_l\left(l = 1, 2, \cdots, K\right)\) (i.e., class semantic centroid, also mentioned in introduction, which helps push apart co-contexts), avoiding the noise from false localization of CAMs.
Self-attention mechanism gathers global sementics, avoiding the limitations of CAM based methods when taken globally (easily confused by co-occurrences). Recall that the goal of “decomposition” is to “decouple” co-occurrences.
Note that ViT is used to gather semantics in one specific image.
Example: image A has a prototype \(a\) for class “boat”, and image B has another prototype \(b\) for class “boat”, then the cosine similarity between \(a\) and \(b\) is calculated. Then, this similarity score is applied softmax to get weights \(W_l\).
Prototypes for the same class from different images will contribute to the same global prototype.
Given multi-class token obtained from global teacher encoder \(\boldsymbol{Z}_l\), the updating process:
\[\boldsymbol{P}_l \leftarrow \mathrm{Norm}\left(\eta\; \boldsymbol{P}_l + \left(1 - \eta\right)\; W_l \cdot \boldsymbol{Z}_l\right).\]
Applying exponential average (for dynamic prediction) upon knowledge and weighted token.
An image is cropped into \(u\) patches. \(\boldsymbol{q}_i\) is the local feature extracted by student. \(\boldsymbol{P}_l^+\) is the positive prototype belonging to the same category with \(\boldsymbol{q}_i\).
\[\mathcal{L}_{\mathrm{LiG}} = - \frac{1}{N_\mathrm{g}^+}\;\sum_{i = 1}^{u} \log\;\frac{\exp\left(\frac{\boldsymbol{q}_i^\mathsf{T}\boldsymbol{P}_l^+}{\tau_\mathrm{g}}\right)}{\sum_{\boldsymbol{P}_l \in \boldsymbol{P}_{\mathrm{s}}}\exp\left(\frac{\boldsymbol{q}_i^\mathsf{T}\boldsymbol{P}_l^+}{\tau_\mathrm{g}}\right)}\]
Force \(\boldsymbol{q}_i\) to be close to its corresponding global prototype (centroid).
A category tag pool is proposed to match the memory bank.
\[B_\mathrm{q}, B_\mathrm{k} = \left(\overbrace{\boldsymbol{x}}^\text{query}, \overbrace{t}^\text{key embedings}\right).\]
Use a queue to capture chronological information: enqueue newest \(B_\mathrm{q}, B_\mathrm{k}\) and dequeue the oldest \(B_\mathrm{-q}, B_\mathrm{-k}\). Update local teacher from student with EMA to keep the memories consistent for contrast and avoid the dramatic variance between the older memories and the newest in the reservoir.
A similarity-based rectification strategy to denoise the tags.
Measure similarity between \(\boldsymbol{q}_i\) and its history embedings.
\[\mu\left(\boldsymbol{q}_i, t_i\right) = \frac{1}{\left|R\left(\boldsymbol{x}, t_i\right)\right|}\sum_{k_+ \in R\left(\boldsymbol{x}, t_i\right)_+} \boldsymbol{q}_i^\mathsf{T}\boldsymbol{k}_+.\]
The similarity between two patches with the same category should be significantly higher than those different.
Once the number of abnormal similarity pairs exceeds a certain proportion \(σ\), we consider \(\boldsymbol{q}_i\) as a noisy embedding eventually.
If \(\frac{1}{\left|R\left(\boldsymbol{x}, t_i\right)_+\right|}\sum_{\boldsymbol{k}_+ \in R\left(\boldsymbol{x}, t_i\right)} \mathbb{1}\left(\boldsymbol{q}_i^\mathsf{T}\boldsymbol{k}_+ \lt \mu\left(\boldsymbol{q}_i, t_i\right)\right) \gt \sigma\), then \(t_i \leftarrow -1\)
Recall that each patch has its category tag (noisy tags already rectified).
Patch-level co-category differentiation:
\[\mathcal{L}_\mathrm{LiL} = - \frac{1}{N_l^+}\sum_{i=1}^n\sum_{\boldsymbol{k}_+}M_f\log\frac{\exp\left(\frac{\boldsymbol{q}_i^\mathsf{T}\boldsymbol{k}_+}{\tau_l}\right)}{\sum_{\boldsymbol{k}^\prime \in R \left(\boldsymbol{x}, t\right)}\exp\left(\frac{\boldsymbol{q}_i^\mathsf{T}\boldsymbol{k}^\prime}{\tau_l}\right)}\]
Loss for SeCo:
\[\mathcal{L}_\mathrm{SeCo} = \mathcal{L}_\mathrm{cls} + \mathcal{L}_\mathrm{cls}^\mathrm{aux} + \alpha\mathcal{L}_\mathrm{LiG} + \beta\mathcal{L}_\mathrm{LiL}.\]
For overall loss, add segmentation loss: \[\mathcal{L} = \mathcal{L}_\mathrm{SeCo} + \gamma\mathcal{L}_{seg}.\]
comparison with SOTAs
ablation study
comparison with other recent methods
performance
Patches are of fixed size. An adaptive patch sizing stategy do better.